On Power-law Kernels, corresponding Reproducing 
Kernel Hilbert Space and Applications 



Debarghya Ghoshdastidar, Ambedkar Dukkipati 
Department of Computer Science and Automation 
Indian Institute of Science, Bangalore - 560012. 
email: {debarghya. g, ad} @csa.iisc.ernet.in 



Abstract — The role of kernels is central to machine learning. 
Motivated by the importance of power-law distributions in 
statistical modeling, in this paper, we propose the notion of power- 
law kernels to investigate power-laws in learning problem. We 
propose two power-law kernels by generalizing Gaussian and 
Laplacian kernels. This generalization is based on distributions, 
arising out of maximization of a generalized information mea- 
sure known as nonextensive entropy that is very well studied 
in statistical mechanics. We prove that the proposed kernels 
are positive definite, and provide some insights regarding the 
corresponding Reproducing Kernel Hilbert Space (RKHS). We 
also study practical significance of both kernels in classification 
and regression, and present some simulation results. 

I. Introduction 

The notion of 'power-law' distributions is not recent, and 
they first arose in economics in the studies of Pareto lEOl 
hundred years ago. Later, power-law behavior was observed 
in various fields such as physics, biology, computer science 
etc. IT3ll . ||4), and hence the phrase "ubiquitous power-laws". 
Though the term was first coined for distributions with a 
negative constant exponent, i.e., f(x) oc x~ a , the meaning of 
the term has expanded in due course of time to include various 
fat-tailed distributions, i.e., distributions decaying at a slower 
rate than Gaussian distribution. This class is also referred to 
as generalized Pareto distributions. 

On the other hand, though the generalizations of information 
measures were proposed in the beginning of the birth of 
information theory, only (relatively) recently their connections 
with power-law distributions have been established. While 
maximization of Shannon entropy gives rise to exponential 
distributions, these generalized measures give power-law dis- 
tributions. This actually led to a dramatic increase in interest 
in generalized information measures and their application to 
statistics. 

Indeed, the starting point of the theory of generalized 
measures of information is due to Alfred Renyi l22l . Another 
generalization was introduced by Havrda and Charvat |14|, 
and then studied by Tsallis [29 1 in statistical mechanics that 
is known as Tsallis entropy or nonextensive entropy. Tsallis 
entropy involves a parameter q, and it retrieves Shannon 
entropy as q — > 1. The Shannon-Khinchin axioms of Shannon 
entropy have been generalized to this case [28 1, and this 
entropy functional has been studied in information theory, 
statistics and many other fields. Tsallis entropy has been used 



to study power-law behavior in different cases like finance, 
earthquakes and network traffic 11231 . Q), 12) . 

In kernel based machine learning, positive definite kernels 
are considered as a measure of similarity between points [24|. 
The choice of kernel is critical to the performance of the learn- 
ing algorithms, and hence, many kernels have been studied in 
literature [8 1. Kernels based on information theoretic quantities 
are also commonly used in text mining and image process- 
ing |15|. However, such kernels are defined on probability 
measures. Probability kernels based on Tsallis entropy have 
also been studied in lfl9l . 

In this work, we are interested in kernels based on maximum 
entropy distributions. It turns out that Gaussian, Laplacian, 
Cauchy kernels, which have been extensively studied in ma- 
chine learning, have corresponding distributions, which are 
maximum entropy distributions. This motivates us to look into 
kernels that correspond to maximum Tsallis entropy distribu- 
tions, also termed as Tsallis distributions. These distributions 
have inherent advantages as they are generalizations of expo- 
nential distributions, and they exhibit power-law nature [23 1, 
ifTTIl . In fact, the value of q controls the nature of the power- 
law tails. 

In this paper, we propose a new kernel based on q-Gaussian 
distribution, which is a generalization of Gaussian, obtained 
by maximizing Tsallis entropy under certain moment con- 
straints. Further, we introduce a generalization of the Laplace 
distribution following the same lines, and propose a similar 
q-Laplacian kernel. We give some insights into reproducing 
kernel Hilbert spaces (RKHS) of these kernels. We prove that 
the proposed kernels are positive definite over a range of values 
of q. We demonstrate the effect of these kernel by applying 
them to machine learning tasks: classification and regression 
by SVMs. We provide results indicating that in some cases, 
the proposed kernels perform better than their counterparts 
(Gaussian and Laplacian kernels) for certain values of q. 

II. Tsallis distributions 

Tsallis entropy can be obtained by generalizing the infor- 
mation of a single event in the definition of Shannon entropy 
as shown in 11291 . where natural logarithm is replaced with 
q-logarithm defined as \n q x = x ?6t,?>0, g^l. 



Tsallis entropy in a continuous case is defined as (9) 



H q (p) = 



1- / (p(x)) q dx 

JR 

g-i 



?el,9>o,^i, (l) 



This function retrieves the differential Shannon entropy func- 
tional as q — > 1. It is called nonextensive because of its pseudo- 
additive nature ||29ll . 

Kullback's minimum discrimination theorem [18] estab- 
lishes connections between statistics and information theory. 
A special case is Jaynes' maximum entropy principle ifTTl . 
by which exponential distributions can be obtained by maxi- 
mizing Shannon entropy functional, subject to some moment 
constraints. Using the same principle, maximizing Tsallis 
entropy under the following constraint 



(2) 



results in a distribution known as g-exponential distribu- 
tion l30l . which is of the form 



x(p(x)) q dx 

q-mean (x) q := ^ = fi, 

(p{x)) q dx 



(3) 



(2 - q)n 

where the g-exponential, exp g (z), is expressed as 

exp 9 (z) = (l + (l-q)z)f*. (4) 

The condition y + = max(y, 0) in |4]) is called the Tsallis cut- 
off condition, which ensures existence of (/-exponential. If a 
constraint based on the second moment, 



^-variance \x )„ := 



(x - p) 2 (p(xj) q dx 



(p(x)) q dx 



= °\ (5) 



is considered along with (|2j, one obtains the g-Gaussian 
distribution ED defined as 

P(x) = a^ CXP <(-|^p)' (6) 

where A q is the normalizing constant (21]. However, instead 
of Q, if the constraint 



(7) 



is considered, then maximization of Tsallis entropy with only 
this constraint leads to a g-variant of the doubly exponential 
or Laplace distribution centered at zero. A translated version 
of the distribution can be written as 



\x\ (p(x)) 9 dx 

(H) 9 :=^ " = & 

(p{x)) dx 



p{x) = h^ q {~¥^w 



(8) 



As q —> 1, we retrieve the exponential, Gaussian and 
Laplace distributions as special cases of ((3), |6} and ((8), 



respectively. The above distributions can be extended to a 
multi-dimensional setting in a way similar to Gaussian and 
Laplacian distributions, by incorporating 2-norm and 1-norm 
in |6} and (|8), respectively. 

III. Proposed Kernels 

Based on the above discussion, we define the q-Gaussian 
kernel G q : X x X n- K, for a given q £ E, as 



G q (x,y) = exp - 



for all x,y€X, (9) 



\x - m 

where X c M. N is the input space, and q, a E K are two 
parameters controlling the behavior of the kernel, satisfying 
the conditions q ^ 1, q ^ 3 and a ^ 0. For 1 < q < 3, the 
term inside the bracket is non-negative and hence, the kernel 
is of the form 



G q {x,y) 



1 



(9-1) 



2/111 



(10) 



(3-?)<r 21 

where ||.||2 is the Euclidean norm. On similar lines, we use ([8]) 
to define the q-Laplacian kernel L q : X x X n- M 



L q (x,y) = exp - 



for all x,y £ X, (11) 



f - y\\i 

(2-9)/?, 

where ||.||i is the 1-norm, and q,/3 £ K satisfy the conditions 
q 7^ 1, q ^ 2 and (3 > 0. As before, for 1 < q < 2, the kernel 
can be written as 



L q {x,y) = 1 



(9-1) 



(12) 



(2-g)/? 1 

Due to the power-law tail of the Tsallis distributions for 
q > 1, in case of the above kernels, similarity decreases 
at a slower rate than the Gaussian and Laplacian kernels 
with increasing distance. The rate of decrease in similarity is 
controlled by the parameter q, and leads to better performance 
in some machine learning tasks, as shown later. Figure [T] shows 
how the similarity decays for both g-Gaussian and g-Laplacian 
kernels in the one-dimensional case. It can be seen that as q 
increases, the initial decay becomes more rapid, while towards 
the tails, the decay becomes slower. 

We now show that for certain values of q, the proposed 
kernels satisfy the property of positive definiteness, which is 
essential for them to be useful in learning theory. Berg et 
al. |j5| have shown that for any symmetric kernel function 




Fig. 1. Example plots for (a) q-Gaussian and (b) g-Laplacian kernels with 
a = = 1. 



K : X x X R, there exists a mapping (f> : A" n- H, H being 
a higher dimensional space, such that K (x, y) = $(a;) T $(j/), 
for all x, y £ A' if and only if i^T is positive definite (p.d.), 
/.e., given any set of points {x\, X2, ■ ■ ■ , x n } C X, the n x n 
matrix K, such that K^- = K{xi, Xj), is positive semi-definite. 

We first state some of the results presented in |5|, which are 
required to prove positive definiteness of the proposed kernel. 

Lemma 1. For a p.d. kernel ip : X x X H> R ip ^ 0, the 
following conditions are equivalent: 

1) — logtp is negative definite (n.d.), and 

2) ip 1 is p.d. for all t > 0. 

Lemma 2. Let ip : X x X i— > R be a n.d. kernel, which is 
strictly positive, then A is p.d. 

We state the following proposition, which is a general result 
providing a method to generate p.d. power-law kernels, given 
that their exponential counterpart is p.d. 

Proposition 3. Given a p.d. kernel p : X x X H> R of the 

form ip(x,y) = exp ( - f(x,y)), where f(x,y) ^ for all 
x, y £ X , the kernel <fi : X x X t— > R g/ve« by 



where c e 



0(^,2/) = (! + cf(x,y)) h , for all x,y £ X, 



(13) 



« p.oL, provided the constants c and k satisfy the conditions 
c > an<i fc < 0. 

Proof: Since, <p is p.d., it follows from Lemma [TJ that the 
kernel / = — log^ is n.d. Thus, for any c > 0, (1 + cf) is 
n.d., and as / ^ 0, we can say (1 + cf) is strictly positive. 
So, application of Lemma leads to the fact that is 
p.d. Finally, using Lemma UJ we can claim (1 + cf) k is p.d. 
for all c > and k < 0. ■ 
From Proposition [3] and positive definiteness of Gaussian 
and Laplacian kernels, we can show that the proposed q- 
Gaussian and (/-Laplacian kernels are p.d. for certain ranges 
of q. However, strikingly, it turns out that over this range, the 
kernels exhibit power-law behavior. 

Corollary 4. For 1 < q < 3, the q-Gaussian kernel, as defined 
in is positive definite. 

Corollary 5. For 1 < q < 2, the q-Laplacian kernel, as 
defined in ( |12) , is positive definite for all j3 > 0. 

Now, we show that some of the popular kernels can be ob- 
tained as special cases of the proposed kernels. The Gaussian 
kernel is defined as 



ipi(x,y) = exp 



\\x-y 



2cr 2 



(14) 



where a £ R, a > 0. We can retrieve the Gaussian kernel ( fT4| ) 
when q — > 1 in the g-Gaussian kernel (TTOj. The Rational 
Quadratic kernel is of the form 



i>2{x,y) = 1 - 



Jdl 
ill 



(15) 



I, c > 0. Substituting q = 2 in flO} , we obtain ( fl"5j ) 
The Laplacian kernel is defined as 



i>3(x,y) = exp ( - 



F - 2/11 1 



(16) 



where <r 6 R, a > 0. We can retrieve ( [To} as g — > 1 in the 
q-Laplacian kernel < fL2] >. 

IV. Reproducing Kernel Hilbert Space 

As discussed earlier, kernels map the data points to a 
higher dimensional feature space, also called the Reproducing 
Kernel Hilbert Space (RKHS) that is unique for each positive 
definite kernel [3|. The significance of RKHS for support 
vector kernels using Bochner's theorem (6), which provides 
a RKHS in Fourier space for translation invariant kernels, is 
stated in [26]. Other approaches also exist that lead to explicit 
description of the Gaussian kernel [27], but this approach does 
not work for the proposed kernels as Taylor series expansion 
of the ^-exponential function Q does not converge for q > 1. 
So, we follow the Bochner's approach. 

We state Bochner's theorem, and then use the method 
presented in |[T6l to show how it can be used to construct 
the RKHS for a p.d. kernel. 

Theorem 6 (Bochner). A continuous kernel ip(x,y) — p{x — 

y) on M. d is positive definite if and only if ip(t) is the Fourier 
transform of a non-negative measure, i.e., there exists p ^ 
such that p(uj) is the inverse Fourier transform of <p(t). 



Then, the RKHS of the kernel ip is given by 

l/HI 2 



f e V 



p(uj) 



with the inner product defined as 



dui < oo 



duj, 



(17) 



(18) 



where f(oj) is the Fourier transform of f(t) and L 2 (M.) is set 
of all functions on R, square integrable with respect to the 
Lebesgue measure. 

It must be noted here that in our case, the existence and non- 
negativity of the inverse Fourier transform p is obvious due to 
the positive definiteness of the proposed kernels (Corollaries [4] 
and[5]l. Hence, to describe the RKHS it is enough to determine 
an expression for p for both the kernels. We define the 
functions corresponding to the q-Gaussian and q-Laplacian 
kernels, respectively, as 



9 6(1,3), (19) 




, 9 6(1,2), (20) 



where /3, a G M, f) > and t = (t u . . . , t N ) G R N . We derive 
expressions for their inverse Fourier transforms pg( w ) an d 
Pl(u), respectively. For this, we require a technical result |[T2l 
Eq. 4.638(3)], which is stated in the following lemma. 

Lemma 7. Let s G (0,oo) and pi,qi,ri G (0, oo) for i = 
1, 2, . . . , N be constants, then the N -dimensional integral 



OO fOO 



JV Pi -1 



flu* 



(i + Eti(^) 9 



r dxidx2 . . . dxjv 



r (-^«)fj^ r fe 



r( s ) 



PiQi 

<H r i 



Substituting in (J23j and using Lemma [7] we obtain 

Pg{u) = 



lire 



mi ,. . ,mjy — 



where 



AT 



JV 



9-1 



"uj mi r( mj + |) 

9{U) = H 



(2mj 



(24) 



(25) 



We now derive the inverse Fourier transforms. We prove the 
result for Proposition [8] The proof of Proposition [9] proceeds 
similarly. 

Proposition 8. The inverse Fourier transform for <pg(^) is 

given by 

1 



PgM - N 



b=0 



6! 



9-1 2 



(3-g)a 2 |M| 2 
2(9-1) 



(21) 



Proof: By definition, 

p G H - (2tt)- n / 2 I exp(it • w)p G (t)dt . (22) 



Expanding the exponential term, we have 

exp(ii • ui) — (cos(tjU)j) + isin^-u^-)) . 

Since, both cos(tjUSj) are (ficit) are even functions for every 
fj, while sin(tja;j) is an odd function, hence integrating over 
M. N , all terms with a sin component become zero. Further, the 
remaining term is odd, and hence, the integral is same in every 
orthant. So the expression reduces to 



JV 00 00 



JV x — !— JV 
1 — 9 







1 + c^"]tj ) cos(tjU j)dti...dt 

3=1 ' i=i 



/v 



Using expansion of gamma function for half integers, we can 
write ( p5| ) as 



JV 2mj 



4 ra im,! 



(26) 



Substituting in ( |24] > and using b = Ylj=x 171 i> we have 
1 



PgM 



4c 2 



(V2c)V(^) 
iV 

9-1 ~~ Y 



V( '.Vrl ; -» 

6=0 



2mi 2m, 



jv 

mi!...TOjv! 



E m fc =b 
fe=l 



(27) 



We arrive at the claim by observing that the terms in the 
inner summation in ( [27] ) are similar to terms of multinomial 
expansion of ^ {w\ + . . . + uj 2 n ) . ■ 
It can be observed that the above result agrees with the fact 
that inverse Fourier transform of radial functions are radial in 
nature. We present corresponding result for q-Laplacian kernel. 

Proposition 9. The inverse Fourier transform for <PL{t) is 
given by 



(?-q)PV2J 



N 



9-1 



b=0 



(2-g)/? 
(9-1) 



2b 



(28) 



(23) 

where 



where c 



(9-1) 



(3-q)o- 2 • Each of the cosine term can be expanded 
in form of an infinite series as 



cos(ijWj) = X (-1) 



2m j ,2m a 

771 i J J 



771 j —0 



(2m,-)! 



iH = 



2mi 2m 2 



2mj> 



E m k =* 
k = l 



with lui , . . . , wjv being the components of us. 



(29) 



V. Performance Comparison 

In this section, we apply the q-Gaussian and g-Laplacian 
kernels in classification and regression. We provide insights 
into the behavior of these kernels through examples. We also 
compare the performance of the kernels for different values 
of q, and also with the Gaussian, Laplacian {i.e., when q — > 
1), and polynomial kernels using various data sets from UCI 
repository ifTUl . The simulations have been performed using 
LIBSVM [7|. Table [I] provides a description of the data sets 
used. The last few data sets have been used for regression. 



Data Set 


Class 


Attribute 


Instance 


1 


Acute Inflammations 


2 


6 


120 


2 


Australian Credit* 


2 


14 


690 


3 


Blood Transfusion 


2 


4 


748 


4 


Breast Cancer* 


2 


9 


699 


5 


Iris 


3 


4 


150 


6 


Mammographic Mass 


2 


5 


830 


7 


Statlog (Heart)* 


2 


13 


270 


8 


Tic-Tac-Toe 


2 


9 


958 


9 


Vertebral Column 


3 


6 


310 


10 


Wine* 


3 


13 


178 


11 


Auto MPG 




8 


398 


12 


Servo 




4 


167 


13 


Wine Quality (red) 




12 


1599 



TABLE I 

Data Sets (sets marked * have been normalized). 



A. Kernel SVM 

Support Vector Machines (SVMs) are one of the most 
important class of kernel machines. While linear SVMs, using 
inner product as similarity measure, are quite common, other 
variants using various kernel functions, mostly Gaussian, are 
also used in practice. Use of kernels leads to non-linear sepa- 
rating hyperplanes, which sometimes provide better classifica- 
tion. Now, we formulate a SVM based on the proposed kernels. 
For the (/-Gaussian kernel (JTOj, it leads to an optimization 
problem with the following dual form: 



min } on 

nei» Z — / 



\\Xi 



> 0, i 



1. 



(3 ~ qW 
n, and Yh=i a iVi 



0, 



subject to a 

where, {x±, . . . , x n } C X are the training data points and 
{yi, . . . , y n } C {—1, 1} are the true classes. The optimization 
problem for the q-Laplacian kernel ( fT2| > can be formulated by 
using exp 9 (— ^-q)^ 1 ) m trie arjove expression. 

The two-dimensional example in Figure [2] illustrates the na- 
ture of hyperplanes that can be obtained using various kernels. 
The decision boundaries are more flexible for q-Laplacian 
and q-Gaussian kernels. Further, viewing the Laplacian and 
Gaussian kernels as special cases (q — > 1), it can be said 
that increase in the value of q leads to more flexibility of the 
decision boundaries. 

We compare the performance of the proposed kernels with 
Gaussian and Laplacian kernel SVMs for various values of q. 




»>::*. ( —•C 






Fig. 2. Decision boundaries using (a) Gaussian, (b) g-Gaussian (q = 2.95), 
(c) Laplacian, and (d) g-Laplacian (q = 1.95) kernel SVMs. 



The results of 5-fold cross validation using multiclass SVM 
are shown in Table [TT] Further, the power-law nature reminds 
practitioners of the popular polynomial kernels 



Pd(x,y) = (x T y + c) c 



for x, y G 



where the parameters c e (0,oo) and d E N. Hence, we also 
provide the accuracies obtained using these kernels. We have 
fixed particular a for each data set, and consider f3 is fixed at 
(3 = a 2 . For polynomial kernels, we consider c = 0, while d 
is varied. The best values of q among all Gaussian type and 
Laplacian type kernels are marked for each data set. In case 
of the polynomial kernels, we only mark those cases where 
the best results among these kernels is better or comparable 
with the best cases of Gaussian or Laplacian types. We note 
here that in the simulations, the polynomial kernels required 
normalization of few other data sets as well. 

The results indicate the significance of tuning the parameter 
q. For most cases, the q-Gaussian and g-Laplacian kernels 
tend to perform better than their exponential counterparts. This 
can be justified by the flexibility of the separating hyperplane 
achieved. However, it has been observed (not demonstrated 
here) that for very high or very low values of a (or (3), 
the kernels give similar results, which happens because the 
power-law and the exponential natures cannot be distinguished 
in such cases. The polynomial kernels, though sometimes 
comparable to the proposed kernels, rarely improves upon the 
performance of the power-law kernels. 

B. Kernel Regression 

In linear basis function models for regression, given a 
set of data points, the output function is approximated as a 
linear combination of fixed non-linear functions as f(x) ~ 
w + J2j=i w j4>j{ x ), where {4>i(.), . . . , <?W(-)} are m e basis 
functions, usually chosen as 4>j(x) — ip(x,Xj), x\,...,xm 
being the given data points, and ip a p.d. kernel. The constants 
{wo, wi, . . . , wm } are obtained by minimizing least squared 
error. Such an optimization can be formulated as an e-Support 



Data Sets 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


Parameter (a = y/]3) 


10 


15 


5 


5 


2 


25 


5 


1.5 


50 


1 


Gaussian (q — > 1) 


86.67 


76.96 


77.27 


96.63 


97.33 


79.28 


82.96 


89.46 


87.10 


97.19 




q = 1.25 


8fi fS7 
ou.u / 


8? 4.6 


77.14 


QfS fvK 


07 


70 40 


8? Qfi 

Oi.7U 


80 95 


87.10 


07 10 

7/.17 




q = 1.50 


8fi fS7 
ou.u / 


8^5 1Q 


76 87 
/ u.o / 


QfS 0^ 


07 ^ 


70 59 




80 95 


87.10 


07 IS 

y i . / j 


d 


g = 1.75 


88.33 


86.38 


76.60 


96.93 


98.00 


79.76 


83.33 


88.94 


86.45 


97.75 


B 


q = 2.00 


88.33 


85.80 


76.74 


96.93 


98.00 


79.88 


83.33 


88.62 


86.13 


97.75 


o 


g = 2.25 


89.17 


85.51 


76.60 


96.93 


98.00 


79.40 


83.33 


87.68 


85.48 


98.31 


& 


q = 2.50 


91.67 


85.51 


76.34 


96.93 


97.33 


80.00 


84.07 


85.49 


85.48 


98.31 




q = 2.75 


98.33 


85.51 


76.47 


97.22 


96.67 


80.48 


84.07 


84.34 


85.16 


98.31 




g = 2.95 


100 


85.51 


75.53 


97.22 


96.67 


80.12 


82.22 


75.99 


85.16 


97.75 


Laplacian (g — >• 1) 


93.33 


86.23 


77.81 


97.07 


96.67 


81.69 


83.70 


94.89 


76.45 


98.88 


$ 


q = 1.25 


95.83 


85.51 


77.94 


97.07 


96.67 


81.57 


83.70 


92.80 


77.42 


98.88 


1 


g = 1.50 


97.50 


85.51 


77.27 


97.07 


96.67 


81.81 


83.70 


89.67 


77.10 


98.88 


& 
j 


q = 1.75 


100 


85.51 


77.14 


97.51 


96.67 


82.29 


83.33 


84.55 


78.39 


98.88 


CP 


q = 1.95 


100 


85.51 


75.67 


97.80 


96.00 


83.73 


82.96 


71.09 


86.77 


95.51 


73 


d = 1 (linear) 


100 


85.51 


72.86 


97.07 


98.00 


82.17 


83.70 


65.34 


85.16 


97.19 


o 


d = 2 


100 


85.22 


76.47 


96.19 


96.67 


83.86 


80.37 


86.53 


76.77 


96.63 


a 

o 


d = 5 


100 


80.72 


76.47 


95.61 


95.33 


83.61 


74.81 


94.15 


64.84 


94.94 


&< 


d= 10 


100 


76.23 


76.47 


94.00 


94.67 


81.69 


74.81 


88.73 


59.03 


93.26 



TABLE II 

Percentage of correct classification in kernel SVM using 5-fold cross validation. 



Vector type problem J25). The kernels defined in ( fT0] > and ([12]) 
can also be used in this case as shown in the example in 
Figure [3] where e-SV regression is used to reconstruct a sine 
wave from 20 uniformly spaced sampled data points in [0, n]. 




Fig. 3. Sine curve obtained by e-SVR using Gaussian, q-Gaussian (q = 
2.95), Laplacian and g-Laplacian (q = 1.95) kernels with a = \ffi = 2 and 
e = 0.01. 



The performance of the proposed kernels have been com- 
pared with polynomial, Gaussian and Laplacian kernels for 
various values of q using data sets 11, 12 and 13. The results 
of 5-fold cross validation using e-SVR (e = 0.1) are shown in 
Table III We fixed particular (3 = a 2 for each data set. Though 
Laplacian kernel seems to outperform its power-law variants, 
the q-Gaussians dominate the performance of Gaussian kernel. 
The results further indicate that the error is a relatively smooth 
function of q, and does not have a fluctuating behavior, though 
its trend seems to depend on the data. The relative performance 
of the polynomial kernels is poor. 



VI. Conclusion 

In this paper, we proposed a power-law generalization of 
Gaussian and Laplacian kernels based on Tsallis distributions. 
They retain their properties in the classical case as q 1. 



Data Sets 


11 


12 


13 


Parameter (a — ^ffi) 


1 


2 


10 


Gaussian (g — ¥ 1) 


11.1630 


0.9655 


0.4916 




q = 1.25 


11.0694 


0.9218 


0.4883 




q = 1.50 


10.9674 


0.9035 


0.4853 




q = 1.75 


10.8826 


0.8986 


0.4823 


1 


q = 2.00 


10.7406 


0.9005 


0.4781 


o 


q = 2.25 


10.5661 


0.9072 


0.4734 




q = 2.50 


10.4428 


0.9424 


0.4661 




q = 2.75 


10.4796 


1.0698 


0.4595 




q = 2.95 


12.2427 


1.5439 


0.4419 


Laplacian (q —> 1) 


9.7681 


0.5398 


0.4298 




q = 1.25 


10.2052 


0.5532 


0.4223 




q = 1.50 


10.9578 


0.6055 


0.4123 


1- 


q = 1.75 


13.2213 


0.7910 


0.3961 


& 


q = 1.95 


17.7303 


1.6934 


0.3784 




d = 1 (linear) 


13.3765 


1.9047 


0.4357 


1 
o 


d = 2 


10.5835 


2.2740 


0.4268 


s 

c 


d = 5 


16.8173 


2.3305 


0.5485 


Oh 


d= 10 


52.4609 


2.7358 


10.5518 



TABLE III 

Mean Squared Error in kernel SVR. 



Further, due to their power-law nature, the tails of the pro- 
posed kernels decay at a slower rate than their exponential 
counterparts, which in turn broadens the use of these kernels 
in learning tasks. 

We showed that the proposed kernels are positive definite 
for certain range of q, and presented results pertaining to the 
RKHS of the proposed kernels using Bochner's theorem. We 
also demonstrated the performance of the proposed kernels in 
support vector classification and regression. 

The power-law behavior was recognized long time back in 
many problems in the context of statistical analysis. Recently 
power-law distributions have been studied in machine learning 



communities. As far as our knowledge, this is the first paper 
that introduces and studies power-law kernels, leading to the 
notion of a "fat-tailed kernel machine". 
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